Skip to content

OCPBUGS-53408: wait for build and ensure OS image is actually new #4924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

cheesesashimi
Copy link
Member

@cheesesashimi cheesesashimi commented Mar 18, 2025

- What I did

Occasionally, there is a delay between the time that a new rendered MachineConfig is produced and OCL begins a build. However, a couple of things happen in the interim:

  • The RenderController updates the MachineConfigPool. Because of the delay mentioned above, the NodeController begins updating all of the nodes with only the new rendered MachineConfig. The OS image remains the same because the NodeController is not ensuring that the image pullspec on the MachineOSConfig is the same as the MachineOSBuild.
  • Because of the work done to the MCD in OCPBUGS-43896: add revert logic to OCL path in MCD #4825, the original check that we had to determine whether the image pullspecs were the same is no longer present. Additionally, the logic change there makes it possible for an OS update to always occur whenever OCL is enabled, further bypassing that check.

This fixes the behavior by making the following changes.

  1. Update the Node Controller to ensure that the both the MachineOSBuild's MachineConfig reference matches the MCP's current rendered MachineConfig while also checking that the MachineOSConfig's image pullspec matches the MachineOSBuild's. In the situation where the MachineOSBuild's pullspec is empty, this check will fail and the Node Controller will requeue.
  2. Update the MCD so that even when OCL is enabled, if the OS images are the same, the OS update process is skipped.

- How to verify it

This bug is difficult to reproduce. However, it seems to occur when multiple OCL-enabled pools are updating simultaneously.

- Description for the changelog
Wait for build and ensure OS image is actually new

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2025
Copy link
Contributor

openshift-ci bot commented Mar 18, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch 2 times, most recently from cab64f7 to 8d2ede1 Compare March 21, 2025 16:24
@cheesesashimi cheesesashimi changed the title check for image pullspec and don't consider OCL for OS updates OCPBUGS-53408: wait for build and ensure OS image is actually new Mar 21, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 21, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-53408, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cheesesashimi
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 21, 2025
@cheesesashimi cheesesashimi marked this pull request as ready for review March 21, 2025 16:29
@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Mar 21, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-53408, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch from 8d2ede1 to e121461 Compare March 21, 2025 19:51
@cheesesashimi
Copy link
Member Author

/retest required

Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

@cheesesashimi: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test bootstrap-unit
/test e2e-agent-compact-ipv4
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-ocb-techpreview
/test e2e-aws-ovn-upgrade-ocb-techpreview
/test e2e-aws-ovn-upgrade-out-of-change
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-ovn-upgrade-out-of-change
/test e2e-azure-upgrade
/test e2e-gcp-op-ocl
/test e2e-gcp-op-techpreview
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-dualstack
/test e2e-openstack-externallb
/test e2e-openstack-hypershift
/test e2e-openstack-parallel
/test e2e-openstack-singlestackv6
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-ovn-upi
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-upgrade
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-images
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-main-bootstrap-unit
pull-ci-openshift-machine-config-operator-main-e2e-agent-compact-ipv4
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-main-e2e-azure-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-ocl
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-techpreview
pull-ci-openshift-machine-config-operator-main-e2e-hypershift
pull-ci-openshift-machine-config-operator-main-images
pull-ci-openshift-machine-config-operator-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-main-security
pull-ci-openshift-machine-config-operator-main-unit
pull-ci-openshift-machine-config-operator-main-verify

In response to this:

/retest required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ptalgulk01
Copy link

@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch 7 times, most recently from b398da2 to c805f52 Compare April 2, 2025 13:45
@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 7, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-53408, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

- What I did

Occasionally, there is a delay between the time that a new rendered MachineConfig is produced and OCL begins a build. However, a couple of things happen in the interim:

  • The RenderController updates the MachineConfigPool. Because of the delay mentioned above, the NodeController begins updating all of the nodes with only the new rendered MachineConfig. The OS image remains the same because the NodeController is not ensuring that the image pullspec on the MachineOSConfig is the same as the MachineOSBuild.
  • Because of the work done to the MCD in OCPBUGS-43896: add revert logic to OCL path in MCD #4825, the original check that we had to determine whether the image pullspecs were the same is no longer present. Additionally, the logic change there makes it possible for an OS update to always occur whenever OCL is enabled, further bypassing that check.

This fixes the behavior by making the following changes.

  1. Update the Node Controller to ensure that the both the MachineOSBuild's MachineConfig reference matches the MCP's current rendered MachineConfig while also checking that the MachineOSConfig's image pullspec matches the MachineOSBuild's. In the situation where the MachineOSBuild's pullspec is empty, this check will fail and the Node Controller will requeue.
  2. Update the MCD so that even when OCL is enabled, if the OS images are the same, the OS update process is skipped.

- How to verify it

This bug is difficult to reproduce. However, it seems to occur when multiple OCL-enabled pools are updating simultaneously.

- Description for the changelog
Wait for build and ensure OS image is actually new

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment on lines +911 to +914
// TODO: We should use the selectors from the build controller since they are
// well-tested and makes querying for this information significantly easier.
// Additionally, this should use listers instead of API clients in order to
// reduce the impact on the API server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Make a Jira card to track this work todo.

Comment on lines 103 to 104
// TOOD: Handle this error better.
klog.Warningf("Error when checking isLayeredPool: %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Make a Jira card to track this work todo.

Copy link
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/hold
Holding so you can decide if you want to address the todo comments before this merges.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2025
@cheesesashimi
Copy link
Member Author

/retest-required

@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch from c805f52 to 51972a7 Compare April 10, 2025 21:21
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 10, 2025
Occasionally, there is a delay between the time that a new rendered
MachineConfig is produced and OCL begins a build. However, a couple of
things happen in the interim:

- The RenderController updates the MachineConfigPool. Because of the
  delay mentioned above, the NodeController begins updating all of the
  nodes with only the new rendered MachineConfig. The OS image remains
  the same because the NodeController is not ensuring that the image
  pullspec on the MachineOSConfig is the same as the MachineOSBuild.
- Because of the work done to the MCD in
  openshift#4825, the
  original check that we had to determine whether the image pullspecs
  were the same is no longer present. Additionally, the logic change
  there makes it possible for an OS update to always occur whenever OCL
  is enabled, further bypassing that check.

This fixes that by doing two things:

1. Update the Node Controller to ensure that the both the
   MachineOSBuild's MachineConfig reference matches the MCP's current
   rendered MachineConfig while also checking that the MachineOSConfig's
   image pullspec matches the MachineOSBuild's. In the situation where
   the MachineOSBuild's pullspec is empty, this check will fail and the
   Node Controller will requeue.
2. Update the MCD so that even when OCL is enabled, if the OS images are
   the same, the OS update process is skipped.
@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch from 51972a7 to 57eee4a Compare April 10, 2025 21:23
@isabella-janssen
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 10, 2025
Copy link
Contributor

openshift-ci bot commented Apr 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cheesesashimi,isabella-janssen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

l, err := ctrl.isLayeredPool(mosc, mosb)
if err != nil {
// TODO: Handle this error better.
klog.Warningf("Error when checking isLayeredPool: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(question): I've checkout out the code and dived into this part and I was surprissed we must be dealing with an error that can only happen if this happens.
I was thinking about handling FeatureGates at a component level (mcd, mcs, controller), like, creating a dedicated struct that internally has a map of all the observed gates that can expose a method to check if the feature gate is enabled or not based on our rules and with centralized error handling. That way, if for example, the gates are not yet loaded, we can emit, in a single place, a log or an event. WDYT?
I'm proposing this not just because of your change, but because I see a similar pattern each time we need to check for a FeatureGate

@cheesesashimi
Copy link
Member Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 11, 2025
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 88b9bb6 and 2 for PR HEAD 57eee4a in total

Copy link
Contributor

openshift-ci bot commented Apr 11, 2025

@cheesesashimi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit 57eee4a link false /test bootstrap-unit
ci/prow/okd-scos-e2e-aws-ovn 57eee4a link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit d5e8466 into openshift:main Apr 11, 2025
15 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: Jira Issue OCPBUGS-53408: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-53408 has been moved to the MODIFIED state.

In response to this:

- What I did

Occasionally, there is a delay between the time that a new rendered MachineConfig is produced and OCL begins a build. However, a couple of things happen in the interim:

  • The RenderController updates the MachineConfigPool. Because of the delay mentioned above, the NodeController begins updating all of the nodes with only the new rendered MachineConfig. The OS image remains the same because the NodeController is not ensuring that the image pullspec on the MachineOSConfig is the same as the MachineOSBuild.
  • Because of the work done to the MCD in OCPBUGS-43896: add revert logic to OCL path in MCD #4825, the original check that we had to determine whether the image pullspecs were the same is no longer present. Additionally, the logic change there makes it possible for an OS update to always occur whenever OCL is enabled, further bypassing that check.

This fixes the behavior by making the following changes.

  1. Update the Node Controller to ensure that the both the MachineOSBuild's MachineConfig reference matches the MCP's current rendered MachineConfig while also checking that the MachineOSConfig's image pullspec matches the MachineOSBuild's. In the situation where the MachineOSBuild's pullspec is empty, this check will fail and the Node Controller will requeue.
  2. Update the MCD so that even when OCL is enabled, if the OS images are the same, the OS update process is skipped.

- How to verify it

This bug is difficult to reproduce. However, it seems to occur when multiple OCL-enabled pools are updating simultaneously.

- Description for the changelog
Wait for build and ensure OS image is actually new

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.19.0-202504120239.p0.gd5e8466.assembly.stream.el9.
All builds following this will include this PR.

cheesesashimi pushed a commit to cheesesashimi/machine-config-operator that referenced this pull request May 1, 2025
…-and-node-controller

OCPBUGS-53408: wait for build and ensure OS image is actually new
cheesesashimi pushed a commit to cheesesashimi/machine-config-operator that referenced this pull request May 2, 2025
…-and-node-controller

OCPBUGS-53408: wait for build and ensure OS image is actually new
cheesesashimi pushed a commit to cheesesashimi/machine-config-operator that referenced this pull request May 21, 2025
…-and-node-controller

OCPBUGS-53408: wait for build and ensure OS image is actually new
cheesesashimi pushed a commit to cheesesashimi/machine-config-operator that referenced this pull request May 22, 2025
…-and-node-controller

OCPBUGS-53408: wait for build and ensure OS image is actually new
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants